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Description 

METHOD_Oi^^ALC^^ FREQU ENCY OF SEQUENCE,^ 

METHOD OF ^CALCULA TING DEGREE OF ISOLATION ^AND 
METHOD OF ESTIMATINGDE^EE_OF_A^ 

Technical Field to which the Invention Belongs 

The present invention relates to a method for supporting 
primer selection. 

Background Art 

While many primer design methods have been proposed in 
the past, it is currently difficult to design a primer, which 
is annealed only in one place. By calculating incidences of 
combinations of all possible alkali arrays having shorter 
arrays (K-tupples) than an EST array registered with a 
database, for example, by calculating incidences of 4»(65536) 
kinds of 8-mer alkali arrays, an array with a high incidence 
and an array with a low incidence can be found. This kind of 
method is disclosed in "Nucleic Acids Res. 19 3887-3891 (R. 
Griffais. P.M. Andre and M. Thibon: 1991)", for example. 

However, since several famous EST databases have many 
similar arrays contributed by many researchers , incidences of 
the arrays cannot be discussed as they are. 

in order to design a primer sandwiching genes . an array 
in a promoter region is required. Therefore, the primer cannot 



1 



w 



be designed only with an EST database, which is a problem. 

Even though DNA polymerase is oligonucleotide with 
several mismatches, it is known that DNA polymerase can be 
recognized as a primer (refer to -Molecular Biology Vol. 28, 
No. 5, Part I 661-663 (L.B.D' Yachenko. A. A. Chenchick. G.L. 
Khaspekov. A.O. Tatarenko and R. Sh. Bibilashvili : 1994")), 
for example. However, primer design methods proposed in the 
past do not consider genomewide mismatch tolerance. 
Furthermore, in the past primer design methods, a mismatch 
tolerance is searched in a database after an alkali array of 
a given primer is determined. Therefore, the search takes time, 
which is another problem. 

It is an object of the invention to support unique primer 

design. 

Disclosure of the Invention 

According to the invention, an incidence of an array 

with a predetermined length (N-mer) in a genome array is 

counted and is evaluated by introducing an isolation degree. 

which is another aspect of array uniqueness, as a value for 
evaluating the mismatch tolerance. The isolation degree is 
defined as a minimum hamming distance between arrays, for 
example. By introducing the isolation degree, the uniqueness 
of an alkali array can be categorized more precisely. 

More specifically, the object of the invention can be 
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achieved by a method for calculating an Indicator indicating 
an incidence of an array in a genome array, the method 
characterized by the steps of calculating incidences of 
partial arrays with a predetermined length in the genome array, 
and storing the incidences relating to the partial arrays with 
the predetermined length in an incidence table. 

The step of storing in the incidence table desirably has 
the steps of omitting the storage into the incidence table for 
partial arrays with the incidence of zero (0) , and using second 
partial arrays having a shorter second predetermined length 
than the predetermined length and storing in a second table 
a position in the incidence table of the partial arrays with 
the predetermined length including the second partial array 
from the beginning. Thus, the memory capacity and processing 
time can be reduced. 

The object of the invention can be achieved by a method 
for calculating an indicator indicating an isolation degree 
of an array in a genome array, the method characterized by 
including the steps of calculating an isolation degree i by 

which j mutation(s) (J=1.2 i-D referring to the 

conversion of j alkali(s) of each of partial arrays with a 
predetermined length do/does not appear in the genome array 
but i mutation(s) referring to the conversion of i alkalis 
appear(s) in the genome array; and storing in an isolation 
degree table the isolation degree with respect to the partial 
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arrays with the predetermined length. 

A unique part can be Identified easily In a genome array 
by using the Incidence and/or Isolation degree, and more 
unique primer can be designed. 

According to a preferred embodiment, the step for 
calculating the Isolation degree has the steps of judging 
whether or not k mutatlon(s) referring to the conversion of 
k alkall(s) of the partial array with the predetermined length 
exist (s) In the partial array with the predetermined length 
with reference to an Incidence table storing an Incidence In 
a genome array with respect to each of the partial arrays with 
the predetermined length, when the k mutatlon(s) exist (s), 
determining k as an Isolation degree, when the k mutatlon(s) 
does/do not exist. Incrementing k and repeating the step of 
judging the presence of the k mutatlon(s). 

According to another preferred embodiment, the step of 
calculating the Isolation degree has the steps of judging, by 
using second partial arrays having a shorter second 
predetermined length than the predetermined length and with 
reference to a second table storing a position. In the 
Incidence table, of the partial arrays with the predetermined 
length Including the second partial array from the beginning, 
whether the k mutatlon(s) with the predetermined length 
exist (s) In which k alkall(s) at a position away from the 
beginning of the partial array with the predetermined length 
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by a second predetermined length Is /are converted, when the 
k mutatlon(s) exist (s) , finding a hamming distance between the 
k mutatlon(s) and the array with the predetermined length, 
when the minimum value of the hamming distance Is k, 
determining the k as an Isolation degree thereof, when the 
minimum value Is larger than k, repeating the step of 
Incrementing k and judging by using the presence of the k 
mutatlon(s) with the predetermined length and the minimum 
value of the hamming distance. 

According to another preferred embodiment, the method 
Includes the step of judging the appearance In the genome array 
based on whether the Incidence In the genome array Is equal 
to or lower than n. When a genome array Is not organized or 
when a same genetic array actually appears In a genome array 
only twice, an Isolation degree extended based on whether the 
Incidence Is equal to or lower than three or not (second 
Isolation degree) Is obtained. Thus, primer design can be 
achieved for a partial array which appears In a genome array 
three times or below but Is similar to no other arrays In the 
genome array. 

According to another preferred embodiment, a method for 
calculating an Indicator Indicating an Isolation degree of a 
genome array Includes the steps of calculating a shortest 
partial array by which a partial array starting from the k*** 
letter of a partial array with a predetermined length no longer 
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appears In a genome array, and calculating the maximum number 
m of partial array uniquely Included In the partial array and 
handling the m as an Indicator Indicating an Isolation degree 
thereof by considering the m as the lower bounds of the 
Isolation degree. 

The absence of similar arrays Is assured by the lower 
bound of the Isolation degree for primer selection Instead of 
an accurate Isolation degree of a longer array (such as a 
50-mer array In a human genome array). For example, when a 
lower bound of the Isolation degree Is "7", arrays having 90% 
similarity or more do not exist in a 50-mer array. Thus, the 
absence of an array having 60% similarity or more does not have 
to be proved accurately. The knowledge of the absence of 
arrays having 90% similarity or more is enough as an indicator 
for the primer selection. 

In the embodiment, the step of judging whether the 
partial array appears or not may be performed based on whether 
the incidence in the genome array is equal to or lower than 
n. 

The object of the invention can be also achieved by a 
method for calculating a first indicator indicating an 
eligibility for a primer of an array including a given alkali 
with respect to alkalis in a genome array by using an incidence 
table created by using the method, characterized by including 
the steps of identifying a seune number of arrays including the 



6 



alkali as a predetermined length with respect to each of 
alkalis Included in a genome array. Identifying an Incidence 
relating to each of the Identified arrays with reference to 
the Incidence table, and calculating the first Indicator based 
on a total sum of the Identified Incidences. 

The object of the Invention can be also achieved by a 
method for calculating a second Indicator Indicating an 
eligibility for a primer of an array Including a given alkali 
with respect to alkalis In a genome array by using an Isolation 
degree table created by using the method, characterized by 
Including the steps of Identifying a same number of arrays 
Including the alkali as a predetermined length with respect 
to each of alkalis Included In a genome array. Identifying an 
Isolation degree relating to each of the Identified arrays 
with reference to the Isolation degree table, and calculating 
the second Indicator based on a total sum of the Identified 
Isolation degrees. 

The object of the Invention can be also achieved by a 
method for calculating a third Indicator Indicating an 
eligibility, for a primer, of an array Including a given alkali 
with respect to alkalis In a genome array by using an Incidence 
table and Isolation degree table created by using the method, 
characterized by Including the steps of Identifying a same 
number of arrays Including the alkali as a predetermined 
length with respect to each of alkalis Included In a genome 
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array. Identifying an incidence relating to each of the 
identified arrays with reference to the incidence table, 
calculating a first indicator based on a total sum of the 
identified incidences, identifying an isolation degree 
relating to each of the identified arrays with reference to 
the isolation degree table, and calculating a second indicator 
based on a total sum of the identified isolation degrees. 

By using these methods, indicators at an alkali level 
in a genome array can be obtained, and design of a more unique 
primer can be supported. 

The object of the invention is also achieved by a method 
characterized by including the steps of assigning, based on 
an indicator obtained by using the method, a different display 
form in accordance with a value or range of the indicator, and 
creating an image representing each alkali in a genome corray 
in accordance the assigned display form. For example, the 
display form may be a color. 

The object of the invention can be also achieved by a 
program for operating a computer for calculating an indicator 
indicating an incidence of an array in a genome array and being 
readable by the computer, the program causing the computer to 
perform the steps of calculating incidences of partial arrays 
with a predeteinnined length in the genome array, and storing 
the incidences relating to the partial arrays with the 
predetermined length in an incidence table. 
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The object of the invention can be also achieved by a 
program for operating a computer for calculating an Indicator 
Indicating an isolation degree of an array in a genome array 
and being readable by the computer, the program causing the 
computer to perform the steps of calculating an isolation 
degree 1 by which j mutatlon(s) (3=1,2, 1-1) referring 

to the conversion of j alkall(s) of each of partial arrays with 
a predetermined length do/does not appear in the genome array 
but 1 mutatlon(s) referring to the conversion of 1 alkalis 
appear(s) in the genome array, and storing in an isolation 
degree table the Isolation degree with respect to the partial 
arrays with the predetermined length. 

Brief Description of the Drawings 

Flg ^ 1 is a block diagram Illustrating an overview of 
a primer design support system according to an embodiment of 
the invention . 

Fig. 2A is a diagram for describing incidences of genome 
arrays according to the embodiment; and 

Fig. 2B is a diagram for describing a first Indicator 
based on incidences • 

Fig. 3 is a diagram for describing an isolation degree 
according to the embodiment. 

Fig. 4 is a flowchart Illustrating processing for an 
Incidence calculation according to the embodiment . 
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Figs. 5A to 5C are diagrams for describing tables 
relating to incidences according to the embodiment. 

Fig. 6 is a flowchart illustrating processing for 
isolation degree calculation according to the embodiment. 

Fig. 7 is a diagram for describing colors to be assigned 
in visualization processing according to the embodiment. 

Fig. 8 is a graph showing a maximum height of a human 
genome array, which is calculated according to a second 
embodiment . 

Fig. 9 is a flowchart showing an overview of processing 
to be performed in a design support apparatus according to the 
second embodiment. 

Preferred Mode for Carrying Out the Invention 

Embodiments of the invention will be described below 
with reference to attached drawings. Fig. 1 is a block diagram 
illustrating an overview of a primer design support system 
according to an embodiment of the invention. As shown in Fig. 
1, the primer design support system 10 has an incidence 
calculator portion 12, an isolation degree calculator portion 
14, an incidence/isolation degree table 16, a visualization 
processing portion 18, and a primer creation supporting 
portion 20. The incidence calculator portion 12 calculates 
an incidence of a partial array having a predetermined length 
(N-mer array) in a given genome array with reference to the 
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genome array to be used for primer design. The Isolation 
degree calculator portion 14 calculates an Isolation degree 
of each partial array with respect to the genome array as 
described later. The Incidence/Isolation degree table 16 
stores an Incidence and Isolation degree relating to each 
partial degree. The visualization processing portion 18 
performs processing required for visualizing and displaying 
the genome array with reference to the Incidence/ Isolation 
degree table 16. The primer creation supporting portion 20 
performs processing of selecting an area to be used as a primer 
with reference to an Image displayed on a screen of a display 
apparatus 24 by the processing by the visualization processing 
portion 18 In response to a manipulation on an Input apparatus 
(not shown) by a user. 

The primer design support system 10 can be Implemented 
by Installing a design support progreun to a computer. 
According to this embodiment, a genome array is read from a 
genome array database (DB) 22. The genome array DB 22 may be 
on a hard disk of the personal computer or may be loaded In 
a server spaced from the personal computer. In the latter case, 
the personal computer may access the server over a network such 
as a LAN and Internet and refer to data In the genome array 
DB. 

Before describing processing by the primer design 
support system 10, a principle of the Invention will be briefly 
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described below. 

Fig. 2A Is a diagram for describing Incidences of a 
genome array. Here, an extremely short genome array, 
"ATATGGGATC" , Is used, and 2-mer arrays are considered as 
partial arrays thereof. As shown In Fig. 2A, a 2-mer array, 
"AT", appears three times In the genome array, and a 2-mer 
array, "GO", appears twice In the genome array. Other 2-mer 
arrays ("TA", "TG" , "GA" , and "TC") appear once, and still 
other cirrays (such as "AA" and "AC") do not appear. 

In this way, the Incidence calculator portion 12 
calculates how many times each of 2-mer arrays appears 
(Incidence) In a genome array. 

Next, an Isolation degree will be described with 
reference to the same genome array and 2-mer arrays as those 
of the example In Fig. 2A. Fig. 3 Is a diagram for describing 
an Isolation degree of each of the 2-mer arrays (partial 
arrays) with respect to the genome array. An Isolation degree 
Is defined herein as a minimum hamming distance between arrays. 
In other words, the Isolation degree of the partial array Is 
"n" where a partial array In which alkalis at n positions are 
replaced by other alkalis appears In the genome array though 
a partial array In which alkalis at 1 positions (Kn) are 
replaced by other alkalis does not appear In the genome array. 

In the example shown In Fig. 3, those resulting from 
replacement of an alkali at one position of the partial array 
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"AT" (one mutation, refer to the reference numeral 302) , that 
±s, "AA", "AG", "AC", "TT", "GT" and "CT" do not appear in the 
genome array (refer to the reference numeral 300). On the 
other hand, the ones underlined in Fig. 3 ("TA", "TC", "GA" 
and "GG") of those resulting from replacement of alkalis at 
two positions thereof (two mutations, refer to the reference 
numeral 303), that is, "TA", "TG" , "TC", "GA" , "GG" , "GC" , "CA", 
"CG" and "CC" appear in the genome array (refer to the 
reference numeral 300). Therefore, the isolation degree of 
the partial array "AT" is "2". 

Similarly, since one mutation of each of the other 
partial arrays "TA", "TG" , "GG", "GA" and "TC" occurs in the 
genome array, the isolation degrees is "1". 

According to this embodiment, an incidence and an 
isolation degree are calculated by using an 18-mer partial 
array, for exeunple. Fig. 4 is a flowchart showing an overview 
of processing to be performed for calculating an incidence 
according to this embodiment. As shown in Fig. 4, the 
incidence calculator portion 12 selects an N-mer array (such 
as an 18-mer array) (step 401) and scans a genome array 
obtained from the genome array DB 22 (step 402). Thus, 
positions where the N-mer arrays appear can be located in the 
genome array, and the incidences can be obtained by counting 
them (step 403). 

The incidence calculator portion 12 stores the 
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Incidence and so on, which are obtained at the step 403, In 
the Incidence/Isolation degree table 16 by using a given N-mer 
array to be processed as a key (step 404) . The above -described 
processing Is performed on all possible N-mer arrays (refer 
to a step 405). Thus, a table can be created. 

In reality, according to this embodiment, by preventing 
N-mer arrays with the Incidence of 0 from appearing, the size 
of the table can be reduced . For example , In the example shown 
In Fig. 2A, the table Including Incidences Is originally as 
shown In Fig. 5A. However, according to this embodiment, the 
size Is reduced as shown In Fig. 5B. The table may be called 
"map size table" . 

Furthermore, by limiting a part to be referred In a table, 
the speed of processing Is Increased. For example. In the 
example shown In Fig. 2A, by using a 2-mer array Itself as a 
key, a table 501 as shown In Fig. 5B can be obtained. However, 
In order to Increase the speed of processing, a sub-table 502 
may be provided In which an alkali array having a shorter 
length (which may be called "hash size") Is used as a key as 
shown In Fig. 5B. Thus, the approximate position thereof In 
the table can be located. This kind of sub -table may be called 
"hash-size table". When a map-size table for N=18 Is used, 
the hash size ^ 14 desirably, for example. 

Next, the Isolation degree calculator portion 14 
calculates the Isolation degree by referring to the map- size 
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table and sub- table (hash-size table) in which the N-mer 
arrays and incidences as a result of the processing in Fig. 
4 correspond to each other. Fig. 6 is a flowchart describing 
processing of calculating an isolation degree. 

First, the isolation degree calculator portion 14 
selects an N-mer array first (step 601) and initializes "i" 
indicating the number of mutations to 1 (step 602). Next, 
another N-mer array, which is i mutation(s) of the N-mer array, 
is selected (step 603). The isolation degree calculator 
portion 14 refers to the hash- size table (step 604) and judges 
whether or not the other N-mer array, that is, the first alkali 
array with a hash- size length appears in the genome array (step 
605) . 

If No at the step 605 and if another N-mer array having 
i mutation(s) remains (No at a step 606), the processing at 
the steps 603 to 605 is performed on the other N-mer array. 
Alternatively, if the presence of appearance of all other 
N-mer arrays having i-mutation(s) in the genome array is 
judged, i is incremented (step 607) . Then, the same processing 
(steps 603 to 606) is repeated for the incremented i 
mutation(s) . 

Here, a technique of identifying N-mer arrays having 
i-mutation(s) and referring to the table (steps 603 and 604) 
will be described. According to this embodiment, in reality, 
at the step 603, an array in a hash-size (hash array) in which 
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a predetermined number of alkalis from the beginning In the 
N-mer array selected at the step 601 are the same Is Identified. 
Then, an array In which 1 alkalis of the alkalis are converted 
Is created, and how many N-mer arrays Including the hash array 
In the beginning exist Is obtained with reference to the 
hash- size table and, then, the map -size table. Thus, a list 
of the N-mer arrays can be obtained. 

Next, the Isolation degree calculator portion 14 
calculates a heuranlng distance between each of the resulting 
N-mer arrays and the N-mer array to be processed (the one 
selected at the step 601) and Judges whether the minimum value 
of the hamming distance is equal to 1 or not (step 608) • This 
is because no calculation is required for the rest if the 
minimum value is 1 since all of the listed N-mer arrays include 
l-mutation(s) of the N-mer array to be processed. 

If judged as Yes at the step 608, 1 is stored in the table 
as the isolation degree of the N-mer array to be processed. 
On the other hand, if Judged as No at the step 608, and if the 
minimum value of the hamming distance is larger than 1, (i+x) 
mutations (x^l) exist. Thus, 1 is Incremented, and the steps 
603 and 604 are repeated. Then, it is Judged whether the 
minimum value of the hamming distance between the listed N-mer 
arrays and the N-mer array to be processed is equal to 1 or 
not. Therefore, a large amount of processing time is not 
required, and the isolation degree of each of the N-mer arrays 



16 



can be calculated. 

The visualization processing portion 18 visualizes 
alkalis In the genome array and creates an Image by using the 
Incidence/Isolation degree table 16 resulting from the 
processing In Fig. 4 and Figs. 5A to 5C. The technique will 
be described below. For example, according to this embodiment, 
an Indicator relating to an Incidence of each alkali Is 
obtained based on the Incidence of the N-mer array. Fig. 2B 
Is a diagram Illustrating a method of calculating an Indicator 
of an element "A (see the arrow 212)" In a genome array (the 
reference numeral 211) when N=6. Here, It Is assumed that N 
arrays Include the element "A" , and a first Indicator Is 
obtained by calculating (a total sum of Incidences of N 
arrays )/N. 

In the example In Fig. 2B, 6-mer arrays "ATGCCA" , 
"TGCCAG", "GCCAGT", "CCATGC", "CAGTCA" and "AGTCAG" appear 
eight times, twice, three times, once, three times and four 
times, respectively. In the genome array. Therefore, the 
Indicator to obtain Is ( 8+2+3+1+3+4 ) /6 . The speed of the 
Indicator calculation can be Increased by referring to the 
Incidence/Isolation degree table. 

An Isolation degree can be obtained similarly. A second 
Indicator relating to an Isolation degree of each alkali can 
be obtained. Also In this case, the speed of the Indicator 
calculation can be Increased with reference to the 
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Incidence/Isolation degree table. 

For example, the visualization processing portion 18 
determines a color to be assigned to each alkali for displaying 
the genome array based on the first Indicator and the second 
Indicator or a third indicator, which is a combination of the 
first indicator and the second indicator. According to this 
embodiment, as the incidence decreases, that is, as the value 
of the first Indicator decreases, the possibility that the 
array containing the alkali is a unique primer increases. On 
the other hand, as the isolation degree increases, that is, 
the value of the second indicator Increases, the possibility 
that the array containing the alkali is a unique primer 
increases. By using these facts, it may be set that, as the 
value of the third indicator increases where the third 
indicators ( second indicator /first indicator), the 
possibility that the array containing the alkali is a unique 
primer increases. 

As shown in Fig. 7, the visualization processing portion 
18 defines that, as the value of the first indicator decreases, 
the level of coldness of the color increases while as the value 
increases , the level of warmness of the color increases . 
Alternatively, the visualization processing portion 12 
defines that, as the value of the second indicator increases, 
the level of coldness of the color increases while as the value 
decreases, the level of warmness of the color increases. In 
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accordance with the setting, the visualization processing 
portion 12 assigns a color to each alkali. Apparently, as the 
value of the third Indicator Increases, the level of coldness 
of the color Increases, as shown In Fig. 7. 

In this way, when an Image colored In consideration with 
the Incidence and/or Isolation degree of each alkali In the 
genome array Is displayed on the screen of the display 
apparatus 24, an operator can Identify primer candidates, 
which may be more unique, with reference to the Image. The 
user can Intuitively find primer candidates, which may be 
unique, with reference to the color given to the genome array. 
The primer creation support portion 20 Includes a tool for 
selecting the presence of the formation of a complementary 
chain In an array and/or a melting temperature and a tool 
(progreun), for avoiding an optimum GC content, a short 
repeated array and/or a palindrome array. Thus, processing 
required In accordance with an Instruction from a user can be 
performed on a primer candidate selected by the user. 
Therefore, the user can design a predetermined primer. 

According to this embodiment , alkalis In a genome array 
can be visualized based on Incidences In the genome array of 
an array with a predetermined length (N) and an isolation 
degree of the array with the predetermined length with respect 
to the genome array. Therefore, a user can intuitively and 
visually check an array Including a more unique alkali . During 
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the calculation of an Incidence and an Isolation degree, a 
processing time required for the visualization Is reduced by 
using the Incidence/Isolation degree table. Furthermore, a 
processing time for creating an Isolation degree relating to 
the array with the predetermined length can be reduced by using 
a hash table relating to an array with a shorter length than 
N. 

Apparently, the Invention Is not limited to the 
embodiment, and various changes and modifications may be made 
without departing from the spirit and scope of the Invention. 
It will be understood that the changes and modifications fall 
within the spirit and scope of the Invention. 

For example, according to the embodiment, both of 
map -size table relating to an array with a predetermined 
length (N) and hash table relating to a shorter array are 
created for a table relating to Incidences. By using them, 
an Isolation degree relating to the array with the 
predetermined length can be calculated, and/or an Incidence 
for creating an Indicator can be Identified, for example. 
However, the Invention Is not limited to these constructions. 
A map- size table may be only provided, and the processing may 
be performed by a so-called binary search. 

In the embodiment, the map size Is 18 (N=14), and the 
hash size ^ 14. However, the sizes are not limited thereto. 
Apparently, tables relating to arrays having other sizes may 
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be created. 

Furthermore, an Indicator for each alkali Is not limited 
to the one according to the embodiment . The visualization 
technique based on an Indicator Is not limited to the one 
according to the embodiment, either. 

While a different color Is assigned In accordance with 
an Indicator In the embodiment, the assignment Is not limited 
thereto. A different lightness of grayscale may be assigned. 
Alternatively, a different display form may be assigned In 
accordance with an Indicator. 

Furthermore, according to the embodiment, the primer 
design support system Includes the Incidence calculator 
portion 12 and the Isolation degree calculator portion 14 and 
creates a table indicating incidences and isolation degrees 
based on an array from the genome array DB 22. The created 
table is used by the visualization processing portion 18. 
However, all of them are not required. For example, a table 
may be created by a system including the incidence calculator 
portion 12 and the isolation degree calculator portion 14, and 
the table may be recorded in a recording medium such as a CD-ROM 
and a DVD-ROM. In this case, a system having the visualization 
processing portion 18 may read the recording medium and 
implement processing for assigning a different color in 
accordance with an indicator relating to a given alkali, for 
example . 
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According to the embodiment, the Invention Is applied 
for supporting design of a primer such as a PGR primer. However, 
the Invention may be also applied for design of mlcroarray 
oligonucleotide, array design for RNAl, array design for gene 
screening, and array design for genome typing. Therefore, the 
"primer" herein may Include an oligomer array. 

[Second Embodiment] 

Next, a second embodiment of the Invention will be 
described. Before describing a construction and processing 
of a system, a principle of the second embodiment will be 
described below. According to the second embodiment, a second 
Isolation degree (that Is, extended Isolation degree) In which 
the concept of an Isolation degree Is extended Is Introduced, 
and various kinds of calculation are performed by using the 
second Isolation degree. Again, an Isolation degree will be 
described briefly, and the second Isolation degree In which 
the Isolation degree Is extended will be described. 
[Second Isolation Degree] 

"G" refers to a genome array having a length |g| here. 
For example, for a human genome, |G| Is equal to about 3 Gbp. 
A partial array "E" thereof Is a genome array having a length 
I E I . Here , the genome array E Is a short array . For example , 
when the partial array E of the genome array G appears In the 
genome array G only once, the Isolation degree of "E" with 



22 



respect to "G" Is the minimum value of the number of mismatched 
alkalis as a result of the comparison between "E" and all of 
the partial arrays of "G" (where the original array Is 
excluded) • As the Isolation degree Increases, the possibility 
that E couples with a wrong place (Inappropriate place) 
decreases . 

A partial array from an "i"*** letter to an "r"*^ letter 
In the genome array G Is written as G(i,ri- A hamming distance 
between an array S and an array T Is written as dH(S,T}. 
Therefore, the hamming distance dnCS^T) may be expressed as: 

dH(S,T) = |{i|Sti,i3 ^ Tei.11. i=I. k>| 
Here, the Isolation degree lsol(E,G) of the peortlal 
array E with respect to the genome array G may be defined as: 
lsol(E,G)=mln{dH(E,G£i,i+k-i] | 
k=|E|, 1=1, |G|-k+l, E^Gti,i+k.i3> 

For example, when the genome array S is "ATGCTGCGATCGTA" 
and the genome array T Is "ATGTTGCGATCCTA" , the hamming 
distance between the genome array S and the genome array T Is 
"2". When the genome array G Is the same as the array S and 
when the partial array E Is an array "ATGCT" having the first 
five elements of the genome array S, lsol(E,G) =2. 

Next, the extended second Isolation degree will be 
described. When all of arrays with the length |e| Included 
In the genome array G are sorted In order of Increasing hamming 
distance with respect to the partial array E, the "n"^^ array 
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is n-neighbor of the array E and Is written as neighborn(E,G) . 
Here, the second isolation degree, that is, the extended 
isolation degree isoln(E,G) is defined as: 
isoln(E,6) = dn (E, neighborn(E,G) ) • 

The isoli(E,G) is the above-described isolation degree. 

For calculating the second isolation degree, a suffix 
array is used. This will be described briefly. The array 
G[l,...,n]=G[l]G[2]...G[n] will be considered. Here, 
G[n]»$ is the largest end letter among other elements. The 
j^** suffix of G is defined as G[j,...nl. This is written as 
Gj. The string G[3...1] is called prefix of Gj. The suffix 
array SA[l,...,n] is an array including an integer J 
corresponding to GJ . The prefixes are sorted in dictionary 
order (such as in alphabetical order in this example). When 
a length of the longest common prefix between the strings s 
and t is |lcp(s,t)|, a height array Hgt[l,...,N] is defined 
as: 

Hgt [ i ] = I Icp ( TsMi) , TsAC 1+11 

Here, Hgt[l]=0 is defined. By using this array, a length 
causing the incidence of the prefix of Tsaeh to be "1" for the 
first time in the string G can be obtained as: 

maxHgt[il=l+max{Hgt[i-l] ,Hgt[i] } 
where the length is maxHgt[i]. Here, maxHtg[ 1 ]=l-i-Hgt [ 1 ] . 
[Technique of Calculating Isolation Degree] 

An isolation degree of the partial array E with respect 
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to the genome array G can be obtained by scanning G only once. 
The calculation requires a period of time, 
0( |6| ( |E|log|E| )^^^) . When the maximum number k of mismatches 
Is given, the calculation time Is 0( | G| (klogk)^^^) . The 
Inventors know that the second Isolation degree lsoln(E,G) can 
be calculated In a period of time, 0( | G | ( | E | log | E | ) ^^^) . 
However, for a human genome, |g| has a size of about 3x10^. 
Therefore, more reduction of the calculation time Is required. 

Therefore, the Inventors Invented to calculate the 
lower bound of the Isolation degree of a given array E with 
respect to G by using a sub- table storing Isolation degrees 
of short partial arrays as many as a memory could hold. 
[Introduction of Divided String] 

A division dec(E,L) of an array E Is defined as a set 
of partial arrays resulting from the division of the array E 
Into m such that the lengths of the partial arrays can be 
uniquely LI (where 1=1, - . . , m) . 

(1) 

The 1^^ partial array is defined as deCi(E,L). 
The Inventors found that, when an array E was given, the 
following equation held for a given division dec(E,L). 
(2) 

Furthermore, the system holds a table of isolation 
degrees relating to partial arrays having lengths p (such as 
18 mer) and below. Here, the following equation holds from 
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the equation above. 
(3) 

where the equal signs hold when p= | E | . 

In order to calculate the left side of the inequality, 
the following technique can be adopted. 

A function f(E) Is defined as: 

(4) 

Based on this, the following linear recurrence equation 
can be obtained, and a lower bound f (E) of the Isolation degree 
can be calculated for a period of time 0(|e| |p|). 

f (E[i,i])=lsoln(E[i,i3,G) (where 1=1, p) 

f {Eu,ii)=max{f (Eci,i-j3) 

+lSOln(E[i-3+i,ii,G) I 3 = 1, p} 

(a recursive step where i>p) 
By solving the recurrence equation about an array E , the 
isolation degree lsol„(E,G) can be obtained. Furthermore, 
when isoln(E£i,i] ,G) (where 1=1, p) can be calculated for 

a constant time by using a sub- table, which will be described 
below, the recurrence equation above can be calculated for a 
period of time 0( |e| |P| ) . 
[Sub -Table] 

In order to calculate the lower bound of the isolation 
degree of the array E, all isolation degrees of partial arrays 
having a length |p| and below must be calculated. According 
to this embodiment, a suffix array and a height array are used. 
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While the maximum height array maxHgt[l] has been described 
above , this can be regarded as a length by which the Incidence 
of the 1^^ prefix In the suffix array Is one or below In the 
array G for the first time. Extending the definition of the 
maximum height array, the maximum height array Is defined as 
"a length by which the Incidence of the 1th prefix of a given 
suffix array Is k or below In the array G for the first time" , 
maxHgtkEl]* In order to calculate the meixlmum height array, 
the definition of a height array Is extended as: 
Hgtic[l] = |lcp(TsA[ii*TsA[i+k]) I 

By using the height array hgtk, the maximum height array 
maxHgt]c[l] can be obtained. 

maxHgtk[l]=l+max{[Hgt]c[l-j] |3=0 k> 

When a data structure Is used In which the number of 
elements under each node of a suffix tree Is written In the 
node, the maximum height array maxHgt^ can be calculated for 
a period of time 0(|g|) by making the round of a tree In a 
depth-oriented manner. However, since 16n bytes are required 
for storing a suffix tree, a memory capacity of 48G bytes Is 
required for storing a suffix tree of a huiman genome array (3 
G (glga) bytes ) • Since 6 bytes are required for storing each 
node In which the number of leaves under the node of the suffix 
tree Is limited to 2® or below, 54 Gbytes are required In total. 
On the other hand, 4n bytes are required for a suffix array. 
A hximan genome array (3 G(glga) bytes) can be stored In 12 



27 



Gbytes. Even when a height array is stored for a length equal 
to or lower than 2®, only 15 Gbytes are required in total. 
Therefore, a suffix array is desirably used in consideration 
of the memory capacity. 

By using the maximum height array maxHgtK, a partial 
array E with a length 1 starting from a position i on the genome 
array G and an isolation degree isol]c(E,G) thereof can be 
categorized as: 

isolk(E,G) = 0 (l<maxHgtk[SA[i] ] ) 

= 1 (l=maxHgtk[SA[i] ] ) 

^ 1 (l>maxHgtK[SA[i] ] ) 
All of the isolation degrees isol)c(E,G) of the partial array 
E having a length |p| and below starting from all positions 
in the genome array G must be calculated. However, for 
maxHgtn[SA[i] ]^|p| , the isolation degree is "0" or "1". Thus, 
a constant time can be obtained from the maximum height array 
maxHgtk and the suffix array. In order to calculate a maximum 
height accurately for all of the partial arrays E having a 
length |p| and below, a separate calculation must be performed 
separately for maxHgtjt[SA[i] ]< |P| . However, the step can be 
omitted in consideration of the calculation of a lower bound 
of the isolation degrees . 

When the accurate calculation of the maximum height is 
not performed, a length |p| is desirably used by which the 
isolation degree isolK(E,G) of partial arrays E having the 
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length |p| and below substantially agrees with the value 
resulting from the category above. 

Fig. 8 is a graph showing a distribution of maximum 
height Hgt in a htunan genome array. In Fig. 8, the horizontal 
eixis indicates the value of the maximum height array meixHgt 
while the vertical axis of the solid line graph indicates the 
number of a position in a human genome array where the maximum 
height is x. The vertical axis of the dotted like graph 
indicates a proportion of the number of positions in a human 
genome array with the meucimum height equal to or lower than 
X in the entire arrays. 

Referring to Fig. 8^ when an array with a length of 16 
(16 mer) is selected in a human genome array, the array may 
be unique (that is, the incidence is one) with the probability 
of 22%. On the other hand, an array with a length of 20 (20 
mer) is unique with the probability of 74%. Apparently, using 
an incidence in an entire array as an indicator indicating 
uniqueness thereof is not appropriate for arrays with a length 
of 20 mer or higher. 

Referring to Fig. 8, most arrays congregate at the 
maximum height of 17 mer. Here, when a length whereby the 
maximum height is "2" for the first time is calculated, the 
pattern peak may be estimated as being with a length of 18 mer 
or more. A part of those having the isolation degree at "0" 
or "1" can be accurately calculated by using maxHgtk. Thus, 
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a desired length |p| Is achieved when the Isolation degrees 
of most arrays having the length |p| or below are "0" or "1". 
Therefore, | p | =18 Is the most suitable for a human genome array. 
In order to search | p | * a lower bound Is calculated by moving 
p In a recurrence equation In the range of 1< | p | < ( E | . However, 
the lower bound Is calculated from maxHgtk without accurately 
calculating the Isolation degrees with the length |p| . This 
technique Is equivalent to solving the above -described 
recurrence equation by using a value obtained from maxHgtk with 
p= I E I • Though the cunount of time calculation of the lower 
bound Is O (E^), the calculation speed Is sufficiently high 
In consideration of the fact that |e| Is about 60 with G>10*. 
[Processing Example] 

Fig. 9 Is a flowchart showing an overview of processing 
to be performed In the design support apparatus according to 
the second embodiment. As shown In Fig. 9, a maximum height 
arrays (mcLxHgti(l) , m€LxHgt2(l), ...) are first calculated In 
the design support apparatus (step 901). For example, 
maxHgti(l) and maxHgt2(i) can be obtained from the following 
equation. As described above, maxHgt]c(l) refers to a maximum 
length by which a string (partial array) starting from the 
position 1 occurs k times or below In the genome array G for 
the first time. 

maxHgti( 1) = 

max( |lcp(l,l+l) I , |lcp{l,i-l) |+1 
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inaxHgt2(l) = 

max( |lcp(i,i+2) | , | lcp(l-l , i+1 ) | , | lcp(i-2,l) | +1 

Next, a table is prepared having a calculation result 
of the second isolation degree of a partial array having a 
length |p| (such as |p|=18) or below (step 902). Then, by using 
the above -described recurrence equation, 

f (E[i.i])=isoln(E£i,ii,G) (where i=l, p) 

f (Eu,i])=max{f (Eci,i.j]) 

+isoln(E[i-j+i,i),G) I 3=1, p} 
(a recursive step where i>p) , 
the lower bound f(E) is calculated (step 903). 

Industrial Applicability 

The invention is applicable to support to design 
oligonucleotide arrays using a unique array, for example. The 
unique array can be obtained from a large amount of array 
information (such as human genome arrays) , which may be useful 
for PCR primer design, design of microarray oligonucleotide, 
array design for RNAi, array design for genetic screening and 
array design for genome typing. 
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